Goto

Collaborating Authors

 Norco


SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge

Haas, Lukas, Yona, Gal, D'Antonio, Giovanni, Goldshtein, Sasha, Das, Dipanjan

arXiv.org Artificial Intelligence

We introduce SimpleQA Verified, a 1,000-prompt benchmark for evaluating Large Language Model (LLM) short-form factuality based on OpenAI's SimpleQA. It addresses critical limitations in OpenAI's benchmark, including noisy and incorrect labels, topical biases, and question redundancy. SimpleQA Verified was created through a rigorous multi-stage filtering process involving de-duplication, topic balancing, and source reconciliation to produce a more reliable and challenging evaluation set, alongside improvements in the autorater prompt. On this new benchmark, Gemini 2.5 Pro achieves a state-of-the-art F1-score of 55.6, outperforming other frontier models, including GPT-5. This work provides the research community with a higher-fidelity tool to track genuine progress in parametric model factuality and to mitigate hallucinations. The benchmark dataset, evaluation code, and leaderboard are available at: https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified.


Do Bayesian Neural Networks Improve Weapon System Predictive Maintenance?

Potter, Michael, Jun, Miru

arXiv.org Artificial Intelligence

This approach lacks the extra information on individual systems with interval-censored data and time-varying weapon system characteristics. A recent method introduced the covariates. We analyze and benchmark our approach, Weibull-Cox Bayesian Neural Network tested on several LaplaceNN, on synthetic and real datasets with standard weapon systems, albeit requiring a held-out validation set [7]. classification metrics such as Receiver Operating Characteristic Moreover, while understanding the population reliability trends (ROC) Area Under Curve (AUC) Precision-Recall (PR) AUC, via a Weibull distribution is informative, this formulation does and reliability curve visualizations.


Bayesian Weapon System Reliability Modeling with Cox-Weibull Neural Network

Potter, Michael, Cheng, Benny

arXiv.org Artificial Intelligence

We propose to integrate weapon system features (such as weapon system manufacturer, deployment time and location, storage time and location, etc.) into a parameterized Cox-Weibull [1] reliability model via a neural network, like DeepSurv [2], to improve predictive maintenance. In parallel, we develop an alternative Bayesian model by parameterizing the Weibull parameters with a neural network and employing dropout methods such as Monte-Carlo (MC)-dropout for comparative purposes. Due to data collection procedures in weapon system testing we employ a novel interval-censored log-likelihood which incorporates Monte-Carlo Markov Chain (MCMC) [3] sampling of the Weibull parameters during gradient descent optimization. We compare classification metrics such as receiver operator curve (ROC) area under the curve (AUC), precision-recall (PR) AUC, and F scores to show our model generally outperforms traditional powerful models such as XGBoost and the current standard conditional Weibull probability density estimation model.